Introduction to Open Data Science - Course Project

About the project

The course is about learning to use R with statistics in, with an emphasis on using it with github. My objective for attending the course is refreshing my knowledge on R and statistics in general. The course has been on my “TODO” plan on PhD studies for a while.

I think the R Data Health book works well for an introduction, although - as I’m not a complete novice. I didn’t read the chapters quite thoroughly - there’s a lot to take in in short time (I missed the exercises in chapter 3, but not chapter 4). Perhaps the book works more as a sample now and later as a reference.

There have been quite a few new things since I last used R, like the tidyverse pipe. I’m glad there is a possibility of using the “language-agnostic standard” of using = for assignment instead of (or in addition to) <-, which I’ve always found odd. A lot of things are similar to features in pandas in Python data analysis (not surprising, as pandas has taken quite a bit of inspiration from R).

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 27 17:21:00 2023"

GitHub repository: https://github.com/dustedmtl/IODS-project


Chapter 2: Regression and model validation

Data analysis

date()
## [1] "Mon Nov 27 17:21:00 2023"

The dataset that is read in contains averaged survey scores for three subjects, student attitude, exam points, age, and gender. There are 166 students.

lr2 <- read.table('data/learning2014.csv', sep=',', header = T)
dim(lr2)
## [1] 166   7
summary(lr2)
##     gender               age           attitude         points     
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Mode  :character   Median :22.00   Median :32.00   Median :23.00  
##                     Mean   :25.51   Mean   :31.43   Mean   :22.72  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##                     Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(lr2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

The score variables deep, surf and stra are distributed between 1 and 5.

Correlations in the data:

Linear regression

Linear regression is used to determine relation between a response variable and one of more (if multiple regression) explanatory variables. For a single variable, the equation to solve is:

\(Y = \alpha + \beta X + \epsilon\)

where all the terms are \(N\)-dimensional vectors (for \(N\) observations). \(Y\) is the response variable and \(X\) is the explanatory variable. The \(\epsilon\) error term is assumed to normally distributed with a mean of 0.

The task is to find such coefficients for \(\alpha\) and \(\beta\) that minimize the sum \(\sum_{i=1}^{N} \epsilon_i^2\). This is called the least squares method.

For multiple linear regression the equation would be similarly \(Y = \alpha + X\beta + \epsilon\), where \(X\) would be a matrix of dimensions \(N * k\) (with \(k\) explanatory variables) instead of a \(N * 1\)-dimensional vector.

Linear regression model for variable points based on three explanatory variables: attitude, stra and surf:

model <- lm(points ~ attitude + stra + surf, data = lr2)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lr2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The test that the fitting is done is omnibus F-test, which tests the hypothesis that coefficients for all three variables are zero. The very low p-value for intercept means that this hypothesis is not likely to be true.

The p-value for attitude (below 0.05) shows that the model fitting is reliable for it.

Multiple R squared value of 0.2074 means that the three variables account for 20 % of variation.

# All four in the same plot
par(mfrow = c(2,2))

plot(model, which = c(1,2,5))

Above are three plots regarding residuals, where residual is a difference between the fitted and actual value.

The first plot (residuals vs fitted) is symmetrical and there is no correlation between residual and fitted value. Linear regression model is appropriate here.

The second plot shows quantiles against each other. This plot is linear too, so linear regression model is still valid.

The last plot uses standardized residuals with identical variance. The result is similar to plot 1.


Chapter 3: Logistic regression

date()
## [1] "Mon Nov 27 17:21:02 2023"

In the previous section we covered linear regression. Here we consider some cases where its use may be inappropriate. In particular, linear regression relies on the response variable being normally distributed. It might not even be continuous. In the case of a binary response variable we would prefer to use logistic regression

\(log(\frac p {1-p}) = \alpha + \beta * X\)

where \(p\) is the probability of one outcome and \(1-p\) probability of the other outcome, with \(\beta\) and \(X\) being coefficient and variable vectors for explanatory variables, which are multiplied element-wise.

Here \(\frac p {1-p}\) are the odds for one outcome against another outcome (the usability of using odds instead of probabilities comes from fact that the fitted lines featuring the former are expected to be straight)

Data analysis

The data set for analysis comes from studies about Portuguese students’ alcohol consumption:

http://www.archive.ics.uci.edu/dataset/320/student+performance

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc <- read.table('data/alc_data.csv', sep=',', header = T)
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

The data has been combined from mathematics and Portuguese classes:

The data has a number of variables related to grading (G1, G2 and G3), absences, extra classes (paid) and previous failures (failures). It also has information about alcohol usage, including a boolean column for high alcohol usage (high_use). The rest are a variety of either numerical or categorical metadata columns.

The objective is to analyze the relation of alcohol consumption to the various categories. The hypothesis is that high alcohol consumption will lead to

  • lower grades (especially G3)

  • more absences

The student will also have spent less time on studying (studytime) and will not have had paid classes.

Viewing the data

library(tidyr); library(dplyr); library(ggplot2)

# install.packages("patchwork")
library(patchwork)

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 <- g1 + geom_boxplot() + ylab("final grade")

g11 <- ggplot(alc, aes(x = sex, y = G3, col = high_use))
g11 <- g11 + geom_boxplot() + ylab("final grade")

g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 <- g2 + geom_boxplot() + ylab("absences")

g3 <- ggplot(alc, aes(x = studytime, y = high_use))
g3 <- g3 + geom_col() + ylab("high use")

g4 <- ggplot(alc, aes(x = paid, y = high_use))
g4 <- g4 + geom_boxplot() + ylab("high use")

g1 + g11 + g2 + g3

Grades are lower and there are more absences with high alcohol usage, although the effect is more pronounced for men than women. Students with high alcohol consumption spent a lot less time on studying.

gh <- ggplot(alc, aes(x = studytime))
gh <- gh + geom_histogram()

gh2 <- ggplot(alc, aes(x = studytime))
gh2 <- gh2 + geom_histogram()

gh + gh2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

table(high_use = alc$high_use, paid = alc$paid)
##         paid
## high_use  no yes
##    FALSE 140 119
##    TRUE   56  55

Those with low alcohol consumption used fewer paid classes, although the difference is not dramatic. My hypothesis was thus incorrect; it’s possible that students who use a lot of alcohol need the extra classes (perhaps to compensate for lack of study time?).

Logistic regression model

m <- glm(high_use ~ G3 + absences + studytime + paid, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + studytime + paid, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.44301    0.51534   0.860 0.389984    
## G3          -0.06369    0.03725  -1.710 0.087317 .  
## absences     0.07688    0.02285   3.365 0.000766 ***
## studytime   -0.58041    0.16270  -3.567 0.000361 ***
## paidyes      0.40796    0.24619   1.657 0.097501 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 414.63  on 365  degrees of freedom
## AIC: 424.63
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)          G3    absences   studytime     paidyes 
##  0.44300723 -0.06368689  0.07687909 -0.58041234  0.40795610

The low P-values for absences and studytime indicate high correlation. Surprisingly (?) grades do not.

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 1.5573836 0.5663598 4.305756
## G3          0.9382987 0.8717178 1.009274
## absences    1.0799115 1.0350021 1.132305
## studytime   0.5596675 0.4024071 0.762707
## paidyes     1.5037411 0.9304002 2.446804

For grades (G3) and paid classes, 1.0 is within the confidence interval, thus there is no evidence that these variables are associated with high alcohol usage.

Predictions

m2 <- glm(high_use ~ absences + studytime, data = alc, family = "binomial")

probabilities <- predict(m2, type = "response")

# add probability
alc2 <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)

table(high_use = alc2$high_use, prediction = alc2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     93   18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67027027 0.02972973 0.70000000
##    TRUE  0.25135135 0.04864865 0.30000000
##    Sum   0.92162162 0.07837838 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$prediction)
## [1] 0.2810811

The probability of incorrect predictions is 28.1 %.

table(high_use = alc2$high_use, high_absences = alc2$absences > 1) %>% prop.table() %>% addmargins()
##         high_absences
## high_use      FALSE       TRUE        Sum
##    FALSE 0.23513514 0.46486486 0.70000000
##    TRUE  0.07027027 0.22972973 0.30000000
##    Sum   0.30540541 0.69459459 1.00000000
table(high_use = alc2$high_use, low_study = alc2$studytime < 2) %>% prop.table() %>% addmargins()
##         low_study
## high_use     FALSE      TRUE       Sum
##    FALSE 0.5486486 0.1513514 0.7000000
##    TRUE  0.1864865 0.1135135 0.3000000
##    Sum   0.7351351 0.2648649 1.0000000
table(high_use = alc2$high_use, low_study_high_absence = alc2$studytime < 2 | alc2$absences > 1) %>% prop.table() %>% addmargins()
##         low_study_high_absence
## high_use      FALSE       TRUE        Sum
##    FALSE 0.19189189 0.50810811 0.70000000
##    TRUE  0.05405405 0.24594595 0.30000000
##    Sum   0.24594595 0.75405405 1.00000000

Guessing based on simple metrics such as low studytime and high absences it is not possible to get good predictions.

Cross-validation

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2945946

Unexpectedly the cross-validation does not improve the previous result.


Chapter 4: Multivariate analysis, clustering and classification

date()
## [1] "Mon Nov 27 17:21:03 2023"

Multivariate analysis

In multivariate analysis several variables are analysed at the same. The difference between analysing a single response variable and multivariate analysis is that in the latter you do not expect to “explain” a single variable in terms of another, rather than (often) exploring connections between the variables.

It is useful to graphically explore the relations between variables, for example using boxplots, scatterplots. Normality of the data can be assessed with quantile plots.

Clustering

The purpose of clustering is to group items into mutually exlcusive groups. The members of a single group should be closer to each other than to members of other groups.

A common metric for determining the difference between items is the Euclidean distance:

\(d_{ij} = \sqrt{\sum_{k=1}^{q}(x_{ik}-x{jk})^2}\)

Another option is manhattan distance, which assumes “block-wise” traversal, i.e. you can only go up/down or north/south along the streets.

The distance metrics are obviously sensitive to different scales; for it to work the data may have to be scaled around the mean and normalized according to standard deviation:

\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)

Hierarchical clustering

In hierarchical clustering each item is iteratively connected to their closest sub-clusters (either single items or clusters previously defined based on them). There are three common ways to measure the distance between clusters:

  • Single linkage

  • Maximum linkage

  • Average linkage

In single linkage, the distance between clusters is based on the closest distance between any members of the respective clusters. Maximum and average linkage are based on maximum and average distance, respectively. Hieratchical clustering produces a tree-like dendrogram.

K-means clustering

The k-means algorithm seeks group items into \(k\) clusters based on some criteria. Distance metrics (over all items) may be used. The issue with this method is that for a large amount of data, calculating the overall distances may be computationally infeasible. An alternative is to choose some initial clustering and then iteratively make small changes to them, only keeping those changes that improve whatever criteria are used to measure “best fit”.

Data analysis

We are analysing Boston housing market data and its relations to various variables. The details about the data set are found here:

https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.92 loaded
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

In addition to median house price, the variables include: crime rate; taxes; ratio of black population; access to education, employment and transportation, among other things.

Renaming the variables to be more descriptive:

names(Boston) = c("crime", "zone.pct", "ind.pct",
                  "river", "nox.ppm",
                  "rooms.avg", "age.pct",
                  "dist", "roads.idx", "tax",
                  "pupil.ratio", "black",
                  "stat.pct", "price")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crime      : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zone.pct   : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ ind.pct    : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ river      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox.ppm    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rooms.avg  : num  6.58 6.42 7.18 7 7.15 ...
##  $ age.pct    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dist       : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ roads.idx  : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax        : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ pupil.ratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black      : num  397 397 393 395 397 ...
##  $ stat.pct   : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ price      : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
##     crime zone.pct ind.pct river nox.ppm rooms.avg age.pct   dist roads.idx tax
## 1 0.00632       18    2.31     0   0.538     6.575    65.2 4.0900         1 296
## 2 0.02731        0    7.07     0   0.469     6.421    78.9 4.9671         2 242
## 3 0.02729        0    7.07     0   0.469     7.185    61.1 4.9671         2 242
## 4 0.03237        0    2.18     0   0.458     6.998    45.8 6.0622         3 222
## 5 0.06905        0    2.18     0   0.458     7.147    54.2 6.0622         3 222
## 6 0.02985        0    2.18     0   0.458     6.430    58.7 6.0622         3 222
##   pupil.ratio  black stat.pct price
## 1        15.3 396.90     4.98  24.0
## 2        17.8 396.90     9.14  21.6
## 3        17.8 392.83     4.03  34.7
## 4        18.7 394.63     2.94  33.4
## 5        18.7 396.90     5.33  36.2
## 6        18.7 394.12     5.21  28.7

The range of the values of the data vary; some are percentages, some are ratios, the rest have a variety of positive values (mainly above 1).

cor_matrix <- cor(Boston) %>% round()
corrplot(cor_matrix,
         method="circle", type = "upper",
         cl.pos = "b",
         tl.pos = "d", tl.cex = 0.6)

The correlation matrix shows positive correlations between crime and taxation and road access. The house price is positively correlated with average number of rooms (there is no variable for house size itself) and negatively correlated with pupil-teacher ratio (i.e. fewer teachers per pupil) and lower status of the population.

Access to river doesn’t seem to be correlated with anything.

To be able to properly analyze the data, we need to scale it.

boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crime <- as.numeric(boston_scaled$crime)
summary(boston_scaled$crime)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

We create a categorical variable for the crime rates.

bins <- quantile(boston_scaled$crime)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
             breaks = bins,
             include.lowest = TRUE,
             labels = c("low", "med_low", "med_high", "high")
             )
boston_scaled <- dplyr::select(boston_scaled, -crime)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
##     zone.pct    ind.pct      river    nox.ppm rooms.avg    age.pct     dist
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##    roads.idx        tax pupil.ratio     black   stat.pct      price crime
## 1 -0.9818712 -0.6659492  -1.4575580 0.4406159 -1.0744990  0.1595278   low
## 2 -0.8670245 -0.9863534  -0.3027945 0.4406159 -0.4919525 -0.1014239   low
## 3 -0.8670245 -0.9863534  -0.3027945 0.3960351 -1.2075324  1.3229375   low
## 4 -0.7521778 -1.1050216   0.1129203 0.4157514 -1.3601708  1.1815886   low
## 5 -0.7521778 -1.1050216   0.1129203 0.4406159 -1.0254866  1.4860323   low
## 6 -0.7521778 -1.1050216   0.1129203 0.4101651 -1.0422909  0.6705582   low
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Using crime as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The newly scaled variables have a mean of zero.

Train and test sets

n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear discriminant analysis (LDA)

LDA is used to reduce the dimensions of the data. Here we want to see which variables are of most relevant to crime rates.

lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2623762 0.2376238 0.2500000 
## 
## Group means:
##            zone.pct    ind.pct       river    nox.ppm  rooms.avg    age.pct
## low       1.0130351 -0.8843534 -0.11640431 -0.8877748  0.4016937 -0.8892838
## med_low  -0.1223800 -0.2803620  0.02481057 -0.5439028 -0.0970797 -0.3520165
## med_high -0.3693281  0.2338000  0.21980846  0.4504817  0.1221638  0.3725393
## high     -0.4872402  1.0171306 -0.03844192  1.0728110 -0.3669600  0.7969423
##                dist  roads.idx        tax pupil.ratio      black    stat.pct
## low       0.9016778 -0.6919117 -0.7678746  -0.3750530  0.3787090 -0.77788777
## med_low   0.3032925 -0.5549882 -0.4764159  -0.0849146  0.3253694 -0.16423052
## med_high -0.4202793 -0.4207972 -0.3056799  -0.4072044  0.1008959 -0.01456376
## high     -0.8442123  1.6379981  1.5139626   0.7806252 -0.8084572  0.94303396
##                price
## low       0.50261900
## med_low   0.04269446
## med_high  0.23914977
## high     -0.75208073
## 
## Coefficients of linear discriminants:
##                     LD1         LD2          LD3
## zone.pct     0.10508373  0.76695007 -0.996090748
## ind.pct      0.09060594 -0.21519401  0.039314143
## river       -0.08983175 -0.04335498  0.140253245
## nox.ppm      0.34706647 -0.77212168 -1.348407409
## rooms.avg   -0.08931250 -0.07466255  0.009250446
## age.pct      0.23625387 -0.15063742 -0.167613813
## dist        -0.06004226 -0.20979392  0.114391468
## roads.idx    3.47749262  1.03091451 -0.289407677
## tax         -0.06994686 -0.18498948  0.836434126
## pupil.ratio  0.11359242  0.06407304 -0.235309144
## black       -0.12366816  0.04762272  0.116339664
## stat.pct     0.20591197 -0.27736956  0.496760751
## price        0.18442180 -0.51471745 -0.213019111
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9505 0.0369 0.0126
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

The first component explains 95% of the variable to be analyzed. The highest coefficient for it is roads.idx, that is, index of accessiblity to radial highways, followed by nitrous oxide pollution, percentage of old houses and lower status of population.

# target classes as numeric
classes <- as.numeric(train$crime)

plot(lda.fit,
     dimen = 2,
     col = classes,
     pch = classes)
lda.arrows(lda.fit, myscale = 2)

The plot confirms that it is the roads access that has most relevance to crime rates.

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      12        2    0
##   med_low    5      13        2    0
##   med_high   0      16       12    2
##   high       0       0        0   26
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
##           predicted
## correct           low    med_low   med_high       high        Sum
##   low      0.11764706 0.11764706 0.01960784 0.00000000 0.25490196
##   med_low  0.04901961 0.12745098 0.01960784 0.00000000 0.19607843
##   med_high 0.00000000 0.15686275 0.11764706 0.01960784 0.29411765
##   high     0.00000000 0.00000000 0.00000000 0.25490196 0.25490196
##   Sum      0.16666667 0.40196078 0.15686275 0.27450980 1.00000000

The method correctly classifies 74.5 % of the items.

sum(correct_classes == lda.pred$class) / length(correct_classes)
## [1] 0.6176471

K-means

Distances

# center and standardize variables
boston_scaled2 <- scale(Boston)

names(boston_scaled2) = c("crime", "zone.pct", "ind.pct",
                  "river", "nox.ppm",
                  "rooms.avg", "age.pct",
                  "dist", "roads.idx", "tax",
                  "pupil.ratio", "black",
                  "stat.pct", "price")

# change the object to data frame
boston_scaled2 = as.data.frame(boston_scaled2)

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

The Euclidean and Manhattan distances for the scaled dataset.

km <- kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

Bonus

km2 <- kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km2$cluster)

Super-Bonus exercise

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1,
        y = matrix_product$LD2,
        z = matrix_product$LD3,
        type= 'scatter3d', mode='markers',
        color=train$crime,
        )

Some clustering is shown based on the crime categories.


(more chapters to be added similarly as we proceed with the course!)